Abstract
Introduction
Minimal residual disease (MRD) detection by flow cytometry is a powerful prognostic indicator in B acute lymphoblastic leukemia (B-ALL), but accurately differentiating normal precursor B cells from leukemic cells requires significant training and experience. This study presents an integrated machine learning pipeline that combines cell- and sample-level assessments to automatically classify B-ALL MRD samples.
Materials and Methods
List mode data from 772 bone marrow samples analyzed for residual B-ALL using a two-tube 6-color panel were collected from Johns Hopkins Hospital cases under an IRB approved protocol; 386 were classified as MRD positive and 386 negative by an expert (MJB). Manual cell-type annotations from a subset of cases were used to train a hierarchical cell-level classification model, which served as a preprocessing step to exclude non-lymphocytes and mature B cells from the full dataset. This facilitated the training of a sample-level classification model. The integrated model development comprises three main stages: First, data Curation and quality check: Raw flow cytometry data files (.fcs or .lmd), clinical interpretations (MRD-negative, MRD-positive <0.1%, 0.1–1%, and >1%), and expert cell-gating results were collected and reviewed to ensure eligibility for analysis. Second, model training: Approved datasets and annotations were used to train and validate machine learning models. The sample-level model utilized previously established frameworks [1,2], such as GMM-SVM, for classifying MRD status. The hierarchical cell-level model employed algorithms like XGBoost for binary classifications: first distinguishing lymphocytes from non-lymphocytes, then separating immature B cells within the lymphocyte population. The trained cell model was then applied to filter irrelevant cell populations prior to sample-level classification. Third, expert review and validation: Model predictions were reviewed by experts using custom visualization and annotation tools, facilitating an efficient feedback cycle for model improvement.
Results
The cell-level classification models demonstrated strong performance, achieving an area under the curve (AUC) of 99.96% and an accuracy of 99.19% for distinguishing lymphocytes from non-lymphocytes, and an AUC of 99.94% with an accuracy of 99.43% for differentiating immature B cells from mature lymphocytes.
At the sample level, we evaluated MRD classification performance using different cell-filtering strategies. With initial cell filtering, the AUC increased from 76.5% to 82.45%, and accuracy improved from 68.79% to 74.74%. Further enhancement was achieved by removing the downsampling step from the previous framework and excluding mature lymphocytes based on cell-type classification, resulting in an AUC of 91.13% and an accuracy of 83.16%. The best performance was observed with the addition of a channel-wise transformation step and excluding mature lymphocytes for data preprocessing, reaching an AUC of 94.06% and an accuracy of 86.40%.
Discussion
These results demonstrate how integrated machine learning methods could deliver consistent automated analysis of B-ALL MRD. We next created an interactive review tool for evaluation of discordant cases that will enable iterative model performance improvements, particularly in challenging scenarios with low-level MRD or atypical immunophenotypes.